In [1]:
%pylab inline
import matplotlib.pyplot as plt
import networkx as nx
import nltk
import pandas as pd
import re
from sklearn.feature_extraction.text import CountVectorizer
import urllib2
pd.options.display.mpl_style = 'default'
This notebook is going to explore a method of text featurization that I will call nframe.
Many basic natural language processing techniques depend on a "bag of words" document model. This model strips away all syntactic information from a document and only looks at term frequency. A term can be one word or more than oneadjacent 'word'; this class of features is referred to collectives n-grams. These bag of n-gram models are widely used for applications like sentiment analysis, classification, and topic modelling. They are a useful representation for these applications because they represent documents as vectors and corpuses as matrices, data structures that computers are fast at processing.
I am interested in using NLP to compare the meanings of different documents. So I have been working on a "new" technique. Likely, it is not really new and I have just not done the requisite amount of literature review. Reconciling this work with existing literature on e.g. skip-gram featurization remains to be done.
To start, let's get a book with several chapters. For simplicity, I have picked Yertle the Turtle, by Dr. Seuss, and treat each stanza as a separate chapter. I encourage you to try this on your own favorite freely available book. Longer books will be more difficult to process and it will be harder to interpret the results.
In [2]:
book_url = "http://www.spunk.org/texts/prose/sp000212.txt"
book = urllib2.urlopen(book_url).read()
print book[:1250]
We are going to nframe each chapter of the book. Look at the text of the book and see if you can split it into chapters using basic Python tools.
In [3]:
# This is a regular expression for the chapter divides in the book
chapter_divide_re = "\n\n"
# We need to clear out Project Gutenberg's line breaks before further processing
to_remove ="\n"
def chapters(book):
chapters = re.split(chapter_divide_re,book)
return [re.sub(to_remove," ",c) for c in chapters]
for i,c in enumerate(chapters(book)):
print "Chapter " + str(i)
print c[:200]
print
When we nframe a chapter, we split it into many sub-documents in a principled way. For example, we can look at the many individual sentences in a chapter. This preserves some of the syntactic information for the chapter without requiring a potentially computationally or conceptually burdensome syntax parser. There's a lot of latitude in how you separate a chapter into sentences. I'm going to do it in a simple way.
In [4]:
# The regular expression that we will split sentences on.
# This could get more complicated if you're more sensitive
# to quotation marks, for example.
sentence_divide_re = "[\.\!\?]"
def sentences(chapters):
return [re.split(sentence_divide_re,c) for c in chapters]
for i,ss in enumerate(sentences(chapters(book))):
print "Chapter " + str(i) + ", " + str(len(ss)) + " sentences"
for i in range(len(ss)):
if i > 2:
break
else:
print ss[i]
Next, for each chapter, we are going to create a document-term matrix for each of its sentences. This is just the old bag of words model everybody is using, and there's lots of open software packages that do this work.
In [5]:
# see here for how to add lemmatization to the tokenizer
# http://scikit-learn.org/stable/modules/feature_extraction.html
vectorizer = CountVectorizer(
# token_pattern=r'\b\w+\b',
#min_df=2,
#max_df=0.6,
binary=True,
stop_words='english',
)
#cruft??
#analyzers = vectorizer.build_analyzer()
# For now, each chapter will have its own feature-index mapping.
# This may make comparison difficult later on.
models = [(vectorizer.fit_transform(ss),vectorizer.get_feature_names()) for ss in sentences(chapters(book))]
# Note that the model is a sparse matrix and each column of the matrix
# is associated with a single word term.
print models[4]
Here's the big idea. Now we're going to turn the document-term matrix of sentences into one term cooccurence matrix for each chapter.
Word on the street is that cooccurence matrices have been used to represent word semantics by Google and others to do amazing things. Or maybe cooccurence is a red herring and the real magic is in deep learning inspired systems like word2vec. Since deep learning is complicated and hard to intepret, I'm using coocurence matrices (and the implied semantic networks) as a poor man's proxy for a document specific vecor representation of the meaning of each word.
Hey I just met you, and this is crazy; but here's my number so call me maybe.
In [6]:
models[4][0].T * models[4][0]
Out[6]:
In [7]:
# compute cooccurence matrix from document-term matrix
def cooccurences(dtm):
return dtm.T * dtm
nframes = [(cooccurences(dtm),f) for dtm,f in models]
Now we have what we need to nframe a corpus of chapters. Each nframing is a purely descriptive matrix of within-document word cooccurences. It takes all the same parameters as the standard bag of words model (size of n-gram, stop word reduction, stemming, lemmatization), and an additional parameter which is the sentence splitting function. You can experiment with different parameters using this notebook.
Let's visualize one of these nframings.
In [8]:
def nframe_pcolor(coc,feature_names):
coc = coc.toarray()
plt.pcolor(coc,cmap=matplotlib.cm.Greys)
plt.yticks(np.arange(0.5, len(feature_names), 1), feature_names) # these lines would show labels, but that gets messy
#plt.xticks(np.arange(0.5, len(feature_names), 1), feature_names)
plt.axis([0, len(feature_names), 0, len(feature_names)])
In [9]:
plt.figure(1,figsize=(12, 12))
nframe_pcolor(*nframes[3])
plt.show()
An nframing can also be read as an adjacency matrix between terms. This is a lot like a model for representing meaning that's been used for many years in cognitive psychology called a semantic network. A semantic network shows how strongly associated different words are as a network with weighted edges. One way to think about nframe is that it creates a semantic network specific to each chapter, or document in the corpus.
Let's visualize one of the nframings as a semantic network.
In [10]:
def nframe_to_semantic_network(coc,feature_names):
coc = coc.toarray()
G = nx.from_numpy_matrix(coc)
occurence = dict([(i,float(coc[i,i])) for i in range(coc.shape[0])])
nx.set_node_attributes(G,'occurence',occurence)
G= nx.relabel_nodes(G,dict(enumerate(feature_names)),copy=True)
return G
In [11]:
G = nframe_to_semantic_network(*nframes[3])
In [12]:
def occ_to_size(occ):
return log(occ+1)*300
def draw_semantic_network(G):
occurence = [u[1]['occurence'] for u in G.nodes(data=True)]
sizes = [log(o+1)*300 for o in occurence]
widths = [G[u][v]['weight'] for u, v in G.edges()]
pos = nx.graphviz_layout(G, prog='neato')
# positive value nodes
plus_nodes = [u for u,d in G.nodes(data=True) if d['occurence'] >= 0]
plus_sizes = [occ_to_size(G.node[u]['occurence']) for u in plus_nodes]
nx.draw_networkx_nodes(G, pos, nodelist=plus_nodes, node_size=plus_sizes, node_color = '#CCFFCC')
# negative value nodes
neg_nodes = [u for u,d in G.nodes(data=True) if d['occurence'] < 0]
neg_sizes = [occ_to_size(-G.node[u]['occurence']) for u in neg_nodes]
nx.draw_networkx_nodes(G, pos, nodelist=neg_nodes, node_size=neg_sizes, node_color = '#FFCCCC')
# positive value edges
plus_edges = [(u,v) for u,v,d in G.edges(data=True) if d['weight'] >= 0]
plus_widths = [G[u][v]['weight'] for u, v in plus_edges]
nx.draw_networkx_edges(G, pos, edgelist=plus_edges, width=plus_widths, edge_color = 'g', alpha = 0.8)
# negative value edges
neg_edges = [(u,v) for u,v,d in G.edges(data=True) if d['weight'] < 0]
neg_widths = [-G[u][v]['weight'] for u, v in plus_edges]
nx.draw_networkx_edges(G, pos, edgelist=neg_edges, width=neg_widths, edge_color = 'r', alpha = 0.8)
nx.draw_networkx_labels(G, pos,label_sizes=sizes);
In [13]:
plt.figure(1,figsize=(10, 10))
draw_semantic_network(G)
Can we use nframing to compare the meaning of different documents? I hope so! Otherwise, this is a big waste of time.
Let's experiment with some ways of using nframings to look at the differences between two chapters. Can you tell from these visualizations what these two chapters are specifically about? Are there concepts or characters represented by the same word that are different in each chapter?
In [14]:
two_chapters = (3,13)
fig = plt.figure(120,figsize=(15, 8))
plt.subplot(121)
nframe_pcolor(*nframes[two_chapters[0]])
plt.subplot(122)
nframe_pcolor(*nframes[two_chapters[1]])
plt.show()
In [15]:
G2 = nframe_to_semantic_network(*nframes[13])
In [16]:
plt.figure(120,figsize=(15, 8))
plt.subplot(121)
draw_semantic_network(nframe_to_semantic_network(*nframes[two_chapters[0]]))
plt.subplot(122)
draw_semantic_network(nframe_to_semantic_network(*nframes[two_chapters[1]]))
plt.show()
To test whether these kinds of visualizations are useful, we are going to construct something called a semantic network diff.
You might be familiar with the idea of a 'diff' in the context of text or code. In that context, a 'diff' shows you which lines have changed between two versions of a document.
Adapting this idea to semantic networks, here we look at the differences between two networks: which nodes and edges does one network have that the other does not have, and vice versa?
In [17]:
def semantic_network_difference(g1,g2):
g3 = g1.copy()
for node,data in g2.nodes(data=True):
if node in g1:
g3.node[node]['occurence'] = g3.node[node]['occurence'] - data['occurence']
else:
g3.add_node(node,occurence=-data['occurence'])
for u,v,data in g2.edges(data=True):
if g1.has_edge(u,v):
g3[u][v]['weight'] = g3[u][v]['weight'] - data['weight']
else:
g3.add_edge(u,v,weight=-data['weight'])
return g3
Now we can walk through the whole book and look at the diff for each pair of consequetive chapters.
In [18]:
def chapter_diff(chx,chy):
gx = nframe_to_semantic_network(*nframes[chx])
gy = nframe_to_semantic_network(*nframes[chy])
gz = semantic_network_difference(gx,gy)
return gz
In [19]:
for n in range(len(chapters(book)) - 1):
plt.figure(120,figsize=(13, 5))
print chapters(book)[n]
gd = chapter_diff(n,n+1)
draw_semantic_network(gd)
plt.show()
#print chapters(book)[n+1]
This notebook was created by Sebastian Benthall for the D-Lab at UC Berkeley.
In [ ]: